This documentation gives an introduction and user manual of scDIV
(acronym of the Single Cell RNA sequencing data Demultiplexing using
Interindividual Variations) an R package to use inter-individual
differential co-expression patterns for demultiplexing the pooled
samples without any extra experimental steps.
if (!requireNamespace("devtools", quietly = TRUE)) install.packages("devtools")
library(devtools)
install_github("isarnassiri/scDIV")All dependency packages automatically will be downloaded, installed and loaded from CRAN-like repositories. We only tested scDIV in the R version 4.1.3 (2022-03-10) environment. You need to have root permission for this distribution, including the installation of any package.
You can find sample input files in
system.file("extdata", package = "scDIV") folder.
cellsnp-lite is used to pileup the expressed alleles in single-cell data, which can be directly used for donor deconvolution in multiplexed single-cell RNA-seq data, which assigns cells to donors without genotyping reference (LINK).
cellsnp-lite gets bam file and list of barcodes as variable inputs, a Variant Call Format (vcf) file listing all candidate SNPs (regionsVCF) as backend input variable:
cellsnp-lite -s possorted_genome_bam.bam -b barcodes.tsv.gz -O FOLDER-NAME -R regionsVCF -p 22 --minMAF 0.05 --minCOUNT 10 --gzip cellsnp-lite generates a vcf file including called genetic variants as follows (Figure 1):
| Figure 1. Example of vcf file generated by cellsnp-lite. |
We use Vireo (Variational Inference for Reconstructing Ensemble Origin) for donor deconvolution using expressed SNPs in multiplexed scRNA-seq data (LINK).
Vireo gets variants info file provided by cellsnp-lite as an input:
vireo -c input-vcf-file -o output-folder --randSeed 2 -N Number-of-donors -t GP We use “donor_ids.tsv” file from outputs of Vireo for downstream analysis (Figure 2).
| Figure 2. Example of output from Vireo. |
Raw count data from 10X CellRanger (outs/read_count.csv) or other
single-cell experiments has the gene as a row (the gene name should be
the human or mouse Ensembl gene ID) and the cell as a column. You can
convert an HDF5 Feature-Barcode Matrix (LINK)
to a gene-cell count matrix using the cellranger mat2csv (LINK)
command provided by 10Xgenomics. The cells in the
read_count.csv file are from the filtered feature-barcode
matrix generated by the cell ranger.
A filtered feature-barcode matrix generated by the cell ranger can be converted from HDF5 feature-barcode matrix to a gene-cell count matrix (read_count.csv) using the cellranger mat2csv (command provided by 10Xgenomics) as follows:
cellranger mat2csv filtered_feature_bc_matrix.h5 read_count.csvYou can see an example of gene-cell count matrix in figure 3:
| Figure 3. Example of gene-cell count matrix file generated by mat2csv. Each column represent a borcode (cell). |
In this step, we start to use a function from scDIV package. We use
scQC() function to identify potential doublet cells based
on the local density of simulated doublet expression profiles as
follows:
library("scDIV")
csQCEAdir <- system.file("extdata", package = "scDIV")
scQC( InputDir = csQCEAdir ) You can find the results in the QC/ folder under the title of ‘Cells_passed_QC_noDoublet.txt’, including a list of selected barcodes (cells).
The GeneExpressionRecovery() function uses SAVER
(single-cell analysis via expression recovery), an expression recovery
method for unique molecule index (UMI)-based scRNA-seq data to provide
accurate expression estimates for all genes in a scRNA-seq profile.
library("scDIV")
csQCEAdir <- system.file("extdata", package = "scDIV")
Donors='donor6_donor2'
FC='FAI5649A17'
GeneExpressionRecovery( InputDir = csQCEAdir, Donors = Donors, FC = FC )You can find the results in the SAVER/ folder with ‘AssignedCells.txt’ and ‘AllCells.txt’ extensions. You can see an example of transformed gene-cell count matrix in figure 4:
| Figure 4. Example of gene-cell count matrix file generated by SAVER. |
The IDCA() function uses correlation coefficients and
performed Inter-individual Differential gene Correlation Analysis (IDCA)
for two donors (D1 and D2) and genes (G1 and G2).
library("scDIV")
ERP = "donor6_donor2_FAI5649A17_AssignedCells.txt"
Donors='donor6_donor2'
FC='FAI5649A17'
InputDir = system.file("extdata", package = "scDIV")
IDCA( InputDir, Donors, FC, ERP, TEST = T )You can find the results in the IDCA_Analysis/ folder with “IDCA.txt”
extention. You can see an example of IDCA() output in
figure 5:
| Figure 5. Top differentially correlated pairs as well as their differential correlation statistics and correlation category (e.g., +/+). AIC and TPrate stands for “Akaike information criterion” and “true positive rate” which are indicators of gaussian mixture model clustering performance to replicate the cluster of cells per donor assigned by genetics demultiplexing. |
The IDCAvis() function visualizes the outputs of IDC
analysis.
library("scDIV")
InputDir = system.file("extdata", package = "scDIV")
IDCAvis( InputDir )You can find the results in the IDCA_Analysis/IDCA_Plots/ as pdf file(s) (Figure 6).
Figure 6. Example of output from
IDCAvis() function. |
The EADDonorPair() function uses inter-individual
differential co-expression patterns for demultiplexing per donor
pair.
library("scDIV")
InputDir = system.file("extdata", package = "scDIV")
EADDonorPair( InputDir )You can find the results in the IDCA_Analysis/Expression_Aware_Cell_Assignment/ folder called “Expression_Aware_Cell_Assignment.txt” (Figure 7).
Figure 7. Example of output from
EADDonorPair() function. For an indicated pair of the
donors (columns Donor1 and Donor2), EADDonorPair() uses a
top differentially correlated gene pair (e.g., XIST and RPS27) and
gaussian mixture model clustering to assign each cell (BARCODE column)
to a donor (predicted_clusters column). In addition, we append the
results of genetic demultiplexing (donor_id, prob_max, prob_doublet,
n_vars, best_singlet, best_doublet) and flow cell ID (FC). |
The EADPoolSmaple() function uses inter-individual
differential co-expression patterns for demultiplexing the pooled
samples:
library("scDIV")
InputDir = system.file("extdata", package = "scDIV")
EADPoolSmaple( InputDir )You can find the results in the IDCA_Analysis/Expression_Aware_Cell_Assignment/ folder called “Results_Expression_Aware_Cell_Assignment.txt” and “Summary.txt”.
For instance as you can see in figure 8, for an indicated cell (BARCODE column) (e.g., AAACCTGGTTTGACTG-1), we consider all possibilities, and most of the time the cell is assigned to the donor 7 (predicted_clusters column). We confirm that the cell belongs to donor-7 (EA_Assignment column) if we successfully assign it to donor-7 (NumAssignedtoDonor column) for an equal or greater number of all pairs of donors (NumDonorPairs column) minus 1.
Figure 8. Example of output from
EADPoolSmaple() function. Please see text for more
details. |
Isar Nassiri, Benjamin Fairfax, Andrew J Kwok, Aneesha Bhandari, Katherine Bull, Angela Lee, Yanxia Wu, Julian Knight, David Buck, Paolo Piazza. Demultiplexing of Single Cell RNA Sequencing Data using Interindividual Variation in Gene Expression.